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6JQ Abstract 

Scene understanding remains a significant challenge in the computer vision community. The visual 
psychophysics literature has demonstrated the importance of interdependence among parts of the scene. 
Yet, the majority of methods in computer vision remain local. Pictorial structures have arisen as a funda- 
mental parts-based model for some vision problems, such as articulated object detection. However, the 
form of classical pictorial structures limits their applicability for global problems, such as semantic pixel 
labeling. In this paper, we propose an extension of the pictorial structures approach, called pixel-support 
parts-sparse pictorial structures, or PS3, to overcome this limitation. Our model extends the classical 
iy5 form in two ways: first, it defines parts directly based on pixel-support rather than in a parametric form, 

and second, it specifies a space of plausible parts-based scene models and permits one to be used for 
inference on any given image. PS 3 makes strides toward unifying object-level and pixel-level modeling 
t-H of scene elements. In this report, we implement the first half of our model and rely upon external knowl- 

edge to provide an initial graph structure for a given image. Our experimental results on benchmark 
datasets demonstrate the capability of this new parts-based view of scene modeling. 

o 

^t" 1 Introduction 

00 

We consider the semantic pixel labeling problem: given a set of semantic classes, such as tree, cow, etc., 
the task is to associate a label with every pixel. Although hotly studied in recent years, semantic labeling 
remains a critical challenge in the broader image understanding community, for obvious reasons like high 
intraclass variability, occlusion, etc. Early approaches have relied on texture clustering and segmentation, 
e.g., [Carson et al.[|2002 1. More recently, conditional random fields have become the de facto representation 



for the problem, e.g., [Shotton et al.[|2009] l. Most such methods learn a strong classifier based on local 



patches or superpixels and specify some form of a smoothness prior over the field. 



Although these methods have demonstrated good success on challenging real world datasets [Sh otton et al. 
|2009[ , their performance remains limited for one key reason: they are intrinsically local making it difficult 
to incorporate any notion of object and even region semantics. Yet, the visual psychophysics literature has 
demonstrated the clear importance of modeling at the object- and inter-object relational level for full scene 



understanding [Biederman 1981 1. Although there has been some work on overcoming the challenge of 
locality (see next section on Related Work for a brief survey), there has been little work in the direction of 
incorporating a notion of scene parts nor the inter-relationship among the parts. 

In contrast, we present a parts-based approach to full semantic pixel labeling that marries an object-level 
model of the parts of the scene with a pixel-level representation, rather than a strictly pixel- or region-level 



model. Our method sits in the broad class of pictorial structures [Fischler and Elschlager 



shown notable success at articulated object modeling in recent years [Fel zenszwalb and Huttenlocher] [2005 1. 



1973|, which have 



However, classical pictorial structures are not well-suited to semantic image labeling: they (1) parameterize 
object parts and abstract them completely from the pixel-level, (2) require all parts to be present in the scene, 
and (3) typically adopt simple relational models (linear springs). These three characteristics of the classical 
models make them unsuitable for image labeling problems. 

Our method, called pixel-support parts-sparse pictorial structures, or PS3, overcomes these limitations and 
takes a step towards a parts-based view of image understanding by proposing a joint global model over 
image parts — objects in the scene such as trees, cars, the road, etc. — which are each nodes in the pictorial 
structures graph. It directly ties each part to a set of pixels without any restrictive parameterization, which 
affords a rich set of object-level measurements, e.g., global shape. PS3 also defines a space of plausible 
part-graphs and learns complete relation models between the pairwise elements. At inference, a suitable 
part-graph is selected (in this paper, manually) and then optimized, which jointly localizes the parts at an 
object level and performs semantic labeling at the pixel level. 

We have tested our method on the MSRC and the SIFT-Flow benchmarks and demonstrate better perfor- 
mance with respect to maximum likelihood and Markov random field performance in a controlled experi- 
mental setting (exact same appearance models). We also compare our methods to existing semantic pixel 
labeling approaches, but do so with limited significant due to our assumption of being given the parts-graph 
for a test image. In the remainder of the paper, we present some related papers, then describe classical 
pictorial structures, our extensions including an appropriate inference algorithm, our experimental results, 
and conclusion and future work. 

2 Related Papers 

Several other recent papers have similarly demonstrated the significance of moving beyond local methods. 
[Hoiem et al.[ |2008J demonstrate the value of incorporating partial 3D information about the scene during 



detection. [Li-Jia et aLl|2009| take a hierarchical approach to full scene understanding by integrating patch- 



level, object-level, and textual tags into a generative model. These examples hold strong promise for scene 
understanding, but are not directly applicable to labeling. One promising method applicable to labeling is 



[Gould et aL}|2008[ , which proposes a relative location prior for each semantic class and model it with a 
conditional random field over all of the superpixels in an image. Whereas their approach defines a joint 
distribution over each of the superpixels in an image, which potentially remains too local, our approach 
defines it essentially in a layer above the superpixels, affording global coverage and the capability to also 
model the shape of each semantic image part. 

Another strategy has been to share information among different sub-problems in image understanding. The 
Cascaded Classification Models approach |HeitzetaLj|2008 1 shares information across object detection and 



geometrical reasoning. Yang et al. (2010) drive a pixel-labeling process by a bank of parts-based object 
detectors; their method demonstrates the power of explicitly modeling objects and their parts within the 
labeling process. 

The Layout Consistent CRF [Winn and Shotton[|2006J uses a parts-based representation of object categories 



to add robustness to partial occlusion and captures different types of local transitions between labels. Other 
methods look to hierarchies. [Ladicky et al.[|2009| propose an elegant hierarchical extension to the problem 



that currently performs best on the classic MSRC benchmark [Shotton et al.||2009| . However, in principle, it 



remains local and does not incorporate a notion of scene parts nor inter-relationship among the scene parts; 
indeed, nor do any of these prior methods. 

3 Classical Pictorial Structures 



Pictorial structures (PS) are a parts-based representation for objects in images. Classical PS models |Fis 



chler and Elschlagerj |1973J are in the class of undirected Markov graphical models. Concretely, pictorial 



2 



structures represent an object as a graph G = (V, E) in which each vertex Vi, i = 1, . . . , n is a /?ar£ in the 
n-part model and the edges E depict those parts that are connected. A configuration L = {Zi, . . . , l n } 
specifies a complete instance of the model, with each k specifying the parametric description of each part 
Vi. For example, in human pose estimation, each k can specifying the location, scale, and in-plane rotation 
of each body part. 

The best configuration for a given image / is specified as the one minimizing the following energy: 



arg mm 

L 



n 



(1) 



where the mi and potentials specify the unary and binary potentials, respectively, for parts k and lj, and 
9 specify model parameters. The specific form of these potential functions is arbitrary, but they are most 



commonly Gaussian functions (elegantly expressed in log-quadratic form in [Sap p et aL\ |2010[ ), which 
gives rise to a spring model interpretation. 

Such parts-based models have found success in the computer vision community for object recognition prob- 
lems. Firstly, pictorial structures are a general framework for parts-based modeling. For example, the 
constellation and star models [Fergus et al.[ |2005| |2007[ semi-supervisedly learns models for the parts of 
various object categories under specific topological arrangements. The framework has been extended to 
track objects in video [Kumar et al.[|2004] . More recently, in the context of human pose estimation, adap- 



tive appearance [Eichner and Ferrari 



2009| and adaptive pose prior [Sa pp et aTTl|2010J were introduced to 



enhance robustness in the presence of weak localization, appearance or other cues. For tracking objects in 
which some parts may be missing, the mixture-of-parts pictorial structure defines a distribution over legal 
part subsets and a mechanism for retrieving an appropriate structure [H ess et aL}|2007| [. Secondly, although 
the optimization is, in general, NP-hard, under certain conditions, such as a tree- structured graph [Felzen- 
szwalb and Huttenlocher[ |2005[, the global optimum can be reached efficiently. Thirdly, pictorial structures 
have a clear statistical interpretation in the form of a Gibbs distribution: 



n^I,0) = ^e^[-H{L\I,9)\ , 



(2) 



where Z(-) is the partition, or normalizing, function and H(-) is the energy function defined in ([!]). This 
statistical view permits principled estimation of the model parameters and globally convergent inference 
algorithms even in the case of general potentials. 

However, classical pictorial structures have significant limitations when applied to more general problems 
in which (1) some parts may be missing, (2) a distribution over structures is present rather than a single one, 
and (3) a precise segmentation of each part is required rather than strictly its parametric description. One 
such problem is semantic pixel labeling. In most images, only a few of the classes are present: e.g., four to 
five for the 21 class MSRC [Shotton et al.[|2009) . Furthermore, the standard parametric descriptions of the 
parts U do not readily map to pixel labels. 

4 The PS3 Model for Semantic Labeling 

We begin with a concrete problem definition for semantic scene labeling. Let A be the pixel lattice and 
define the basic elements A C A to be either individual pixels, patches, or superpixels, such that (J A = A 
and Ai P| A2 = 0. Let Z specify the set of semantic class labels, e.g., car, tree, etc., and denote z\ as the 
label for element A. In the maximum a posteriori view, the labeling problem is to associate the best label 
with each element 



{z x y = aigmaxP({z\}\I,0) , 



(3) 



3 



but we do not directly model the problem at the pixel level. Rather, we model it at the object level k as we 
now explain. 

Parts with Direct Pixel Support. We take a nonparametric approach and directly represent the part k based 
on its pixel support. Each part k comprises a set of basic elements {A^, A^ 2 ), . . . }, and induces a binary 
map, Bi\ A \-> {0,1}. A configuration L jointly represents a high-level description of the elements in a 
scene, and also a direct semantic labeling of each pixel in the image. Furthermore, rich, pixel-level descrip- 
tions of part-appearance and part-shape are now plausible. However, it adds significant complexity into the 
estimation problem: fast inference based on max-product message passing [Felzenszwalb and Huttenlocherl 



2005 1 is no longer a viable option as the parts have a more complex interdependent relation among their 
supports. 

Parts-Sparse Pictorial Structures. Classically, pictorial structures models are defined by a fixed set of n 
parts, and all are expected in the image. In scene labeling, however, most images contain a small subset of 
the possible labels Z. We consider the space ft containing all plausible pictorial structures for scene labels. 
ft is large, but finite: for an image of size w, the upper bound on nodes in a PS3 model is |w|, but the typical 
number is be quite smaller, e.g., around three to five for the MSRC dataset. Each node can be of one class 
type from Z. Whereas classical pictorial structures model the parameters 9 for a specific structure, in PS3, 
we model 9 in the unary and binary terms at an individual and pairwise level, independent of the structure. 
Then, for any plausible layout of parts, we can immediately index into their respective parameters and use 
them for inference. 

In this paper, we do not define an explicit form on how is distributed. Rather, we enumerate a plausible 
set of structures and tie one to each image, but in the general case, PS3 samples from ft. In spirit, this notion 
of parts-sparse pictorial structures has appeared in [Hess et aL} [2007]. Their mixture-of-parts pictorial 



structures model has similarly relaxed the assumption that the full set of parts needs to appear. One can 
indeed use this mixture distribution on ft. We further compare our approach to MoPPS in Section [6] 

Standard Form of PS3. The terms of the energy function underlying PS3 operate on functions of the parts 
(/)(•) and ^( ) rather than the parts directly. These functions are arbitrary and depend on how the potentials 
will be modeled (we specify exact definitions in the next section). The standard form of a PS 3 from ft is 

\i=l €ij J 

4.1 Model Details for Semantic Image Labeling 

Unary Term. The unary term will capture the appearance, shape and the location of the parts: 

m((j)(li)\9) =QLA™>A(k\@) + <— appearance 

OLsms(k\0) + <- shape 

^l^l(w)I^) ^—location (5) 

The a. are coefficients on each term. The (p function maps the pixel support part k to the pair (Zj, /^), where 
Hi is the centroid of If. fi{ = ^ J2\eh ^xgA x - The terms °f j^j) are described next. 
Appearance. We model appearance in four-dimensions: Lab color-space and texton space. The texton 
maps use a 61 -channel filter bank [ Varm a~and Zisserman| |2005| of combined Leung-Malik [Leung and 



Malik} |200T| and Schmid filters [Schmid|[2001[ , followed by a k-Means process with 64 cluster centers. No 



experimentation was performed to optimize the texton codebook filter and size choice. 

The appearance model for a particular class z is specified by a set of normalized Lab+texton histograms, 
h( p \p = 1, . . . , 4, in both the foreground h z and background Kq z . The background histograms are specific 
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Sheep Foreground 



Sheep Background 



Figure 1: Example of the narrowband used to capture a part-specific background region during learning 
and inference. Black regions represent "on" pixels for the sheep part. 



to class z and are modeled using the narrowband dU of pixels surrounding the foreground (see Figure [T] 
for an illustration). For a histogram distance D (we use intersection), which is applied in each dimension 
independently and summed, the ratio 



m A {li\0) 



l D(h z ,h dk )D{h dz ,h k ) 
4D(h z ,h h )D(h dz ,h dh ) 



(6) 



specifies our appearance potential. The numerator term measures the cross-fitness: how well the foreground 
histogram matches the background model and vice versa; and the denominator measures the actual fitness 
of the part k to the class histograms. The smaller the numerator and larger the denominator the better the 
overall fit and hence the lower the energy. 
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Figure 2: Visual rendering of some of the shape models. Each image shows the map Bi that is centered 
around the part centroid and normalized to a unit-box coordinate system. The images are rendered using 
the jet colormap in Matlab. The figure has been broken into objects (left) and stuff (right) to emphasize the 
disparity in expressiveness between the shape maps for the two part types. 

Shape. The capability to model global part shape is an key feature of the PS3 model. We model it nonpara- 
metrically using a kernel density estimator. For a pixel x, member of part k with centroid jii and class Zi, 
define its normalized coordinate with respect to its part, x = (x — /^) /w where w is a vector specifying the 
width and height of the image. The shape probability of the pixel is 



1 N 

P(x\0,Zi) = 



(7) 



where N is the number of samples for this shape from the training data, which have all been normalized with 
respect to the centroid of their constituent part and reference frame, and </?(•) is a windowing function that 
returns 1 if its argument is less than the size of a pixel in the normalized reference frame and otherwise. 
In practice, we quantize the density and store a discrete map of 201 x 201 normalized pixels; call this map 
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S Zi . Recall, a part k induces a binary membership map B{ at the (normalized) pixel level. Finally, the shape 
potential is defined as the mean shape probability over the part's constituent pixels: 



m s (k\9) = - log _||B i0 5^|| F + 
V w 



\(l-Bi)Q(l-S z 



(8) 



where is the element- wise product and the Frobenius norm is used. 

Example shape densities are shown in Figure[2j There is a clear distinction in the expressiveness of this shape 
model between the objects (e.g., airplane, face) and the stuff (e.g., sky, road). The maps for the stuff classes 
tend to be diffuse and indiscriminate, whereas the maps for the object classes are mostly recognizable. 
Section [6] shows that this object-level modeling significantly aids in labeling of the object-type classes. 

Location. We model the part location with a Gaussian distribution on its centroid For class z, denote 
the mean centroid location v z and the (full) covariance matrix T, z . The location potential is hence the 
Mahalanobis distance: 



m L (fjbi\6) = {m - z^) T X 1 (m 



(9) 



Figure [3] gives a few examples of the location potential. Note the contrast, in terms of objects and stuff, to 
the shape potential: the objects tend to be less informative in terms of location than the stuff. 
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Figure 3: Samples of the location potential rriL{Hi\Q) grouped again in terms of objects and stuff. 



Binary Term. The binary term will capture the relative distance and angle of pairwise connected parts: 



--a D d D (m, fjij\0) + 

OLRd R {m, fJLj\0) 



distance 
angle (10) 



The a. are again coefficients on each term. The ip function maps the pixel support part k to the is the 
centroid of k. More sophisticated and ip functions are plausible with the PS3 framework, but we do not 
explore them in this paper. 

Distance. The relative part location is captured simply by the distance between the parts (classical pictorial 
structures). For parts k and lj, we evaluate the distance Vij = — fij\\2 and model it by a Gaussian 
parameterized by (z^-, <j?-). The distance potential is do^Hi, fJ>j\0) = — ^ij) 2 /vfj • 

Angle. We model the relative angle between the two parts by a von Mises distribution, which is a symmetric 



and unimodal distribution on the circle [Berens 2009 1. Let denote the angle relating part i with respect 
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to part j. The von Mises distribution is a function of (J ZiZj9 the mean direction, and n ZiZ . the concentration 
parameter (similar to variance): 



exp [ KziZj ( 



Tin ~ LVz 



2irIo(K fZiZj ) 

where Jo(-) is a Bessel function. Finally, the angle potential is the negative log: 

d R (m, Hj\6) = -\ogP{rij\e,z h Zj) . 



(11) 



(12) 



Some examples are presented in Figure [4] These examples suggest this angle potential is jointly useful for 
the objects and the shape, especially how they inter-relate. For example, consider the rightmost plots of 
tree-given-cow: in the MSRC dataset, cows appear in (or on) pasture nearly always and there are often trees 
on the horizon. 
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tree given cow 




Figure 4: Samples of the angle distribution. 

4.2 Learning the PS3 Model 

For a training set of images {ii 5 . . . , In} and corresponding configurations {Li, . . . , L^}, which are essen- 
tially pixel- wise image labelings, learning the parameters is cast as a maximum likelihood (MLE) problem. 
[Felzenszwalb and Hut tenlocher, 2005] show that the parameters 9 on the unary potentials can be learned in- 
dependently, and this holds for our pixel-support parts. In our case, we do not seek to learn a tree- structured 
graph. Instead, we define a deterministic mapping from a label image, or configuration, Li to a graph G in 
the following manner: for each connected component in Li create a part in the graph. Two parts are adjacent 
in the graph if any pixel in their respective connected components are adjacent in the label image Li. It is 
our assumption that this general structure adds necessary descriptiveness for the labeling problem. Finally, 
for each pair of adjacent parts, we learn the parameters on the binary potentials via MLE. 

The last part of learning is to estimate the five a weights on the various potentials. It is widely known 
that estimating these weights is a significant problem as it requires estimation of the full partition function, 
which is intractable [Winkler, 2006]. However, in our case, the problem is compounded, even some stan- 
dard approximations like pseudo-likelihood are intractable because of the pixel-support nature of the parts. 
Because of these complexities, we simply set the coefficients such that the relative scale of each potential is 
normalized and finally we ensure the weights specify a convex combination over the potentials. 

5 Inference with Data- Adaptive MCMC 

Inference with the PS3 model has two main components: (1) determine the structure of the PS3 for the 
image at hand, and (2) determine the optimal configuration L* given a structure. In this paper, we study the 
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latter and leave the former for future work. Although this limits the generality of the contributions proposed 
herein to cases in which a suitable structure for the PS3 could be determined or given, we show that even the 
determination of the optimal configuration alone is a significant problem. Furthermore, direct extensions of 
the proposed methods present viable options for handling the structure inference, as we will discuss. The 
configuration inference problem is posed as 

L* = argmaxexp[-#(L|/,60] . (13) 

L 

The corresponding energy minimization problem is argmin L —H(L\I, 9). In general, this problem is NP- 
hard, but seems similar to the standard form for which our community has hotly studied approximate so- 
lutions over the past decade [Szeliski et al.[|2006| . However, as noted by [Felzenszwalb and Huttenlocherl 



2005|, the structure of the graph and the space of possible solutions differ substantially; in other words, the 
minimization problem cannot be cast as a local labeling problem. 

Consider the variables in question, L = {/i, . . . , l n }. We already know the class z% of each part and each k 
has a complex interrelationship to the other parts via its pixel support. For example, taking one element A 
away from li and moving it to lj has part-global effects on both l{ and lj in terms of appearance and shape, 
which differs quite drastically from these prior methods. One could consider defining the PS3 inference as 
a labeling problem over the elements {A} with each part k being a labeling index and associating a label 
variable, say £j, with each element Xj. However, inference would remain outside of the scope of these 
methods, again because a local change of one label variables £j would have a far-reaching affect on many 
other elements {\k : €j = £k}- 

In addition, classical pictorial structures use parametric representations of U, such as part-centroid, and for 
the typical spring-model case, define a Mahalonobis distance to capture the ideal relative location between 
parts. Casting our nonparametric form l{ into this framework would yield an intractable high-dimension 



problem: even though we rely on parametric functions of k for our binary potentials ( [10] ) no convenient 
form of the ideal location is possible since the k are tied directly to the pixel support. 

MCMC Sampler. We hence adopt a Metropolis-Hastings (MH) approach to handle the general inference 
[Andrieu et al.[ [2003 J. The MH sampler is straightforward and yet, guaranteed to (eventually) sample 



from the underlying invariant distribution P(L\I, 9) as it satisfies the detailed balance equation, even when 
P(L\I, 9) is known only up to a constant. Furthermore, the clique Gibbs form of P(L\I, 9) guarantees such 
an invariant distribution exists [Winkler, 2006]. MH is an iterative algorithm that walks a Markov chain 



through the state space according to the following acceptance probability 

ML® L>) = min/l ^l-H(L>\I,e)]Q(LM\L>) 1 

where is the chain configuration at time t, V is the proposed move, and Q(-) is the proposal distribution. 
We adopt a superpixel specification of the elements {A}, computed via [ |Felzenszwalb and Huttenlocher 



2004]. Each proposed move of the Markov chain acts by moving a superpixel from one part, say ^, to 

>ves are proposed according to t 

1 n ( 
Q{J]\L®)=—Y\WM 
n i=i \ 



another part, say lj. Such moves are proposed according to the following proposal distribution: 

~D(h dk ,hx) 



D(h h ,h x ) 

D(h k ,h x ) A 
D(h dh ,h x )\ ) 



+ 



(15) 



8 



where the A is the proposed element to change, S(-) is the normal Dirac delta, h\ is the histogram from 
element A, Z is the normalizing term, and we have overloaded the l\ notation to mean the part containing A 
in the current configuration. The proposal distribution has an intuitive explanation. First, we uniformly 
sample from each of the parts. Second, we sample the elements according to how well they would fit their 
new role with respect to the sampled part based on the ratio of foreground to background appearance fit if 
the element is currently outside of the part and vice versa. Although not represented in the equation for 
clarity, we only consider those elements touching the boundary of sampled part (both inside and outside). 



Although the chain is guaranteed to converge regardless of its initialization [Winkler, 2006], we initialize 



by assigning each superpixel A based on the ratio of its distance and appearance likelihood to each part 
in the PS 3 graph. 

Data-Adaptive Simulated Annealing. We embed the MH sampler into a simulated annealing process 
HGeman and Gemanl|1984[ because we seek the maximum a posteriori samples. Simulated annealing adds 



a temperature parameter T into the distribution, p( T ) (L\I, 9) = exp [— (L\I, 9)] , such that as T —> 
the p( T )(L|/, 9) distribution approaches the modes of P{L\I, 9). However, the theoretical guarantee exists 
on fairly restrictive bounds on the cooling schedule, the sequence of temperatures as the process is cooled 



[Andrieu et al.[|2003| . Furthermore, it is not well understood how to set the cooling schedule in practice, 
especially for very high-dimensional sample spaces, such as the one at hand. The challenge is that one 
proposal move V will change the density quite little resulting in acceptance probabilities near uniform 
unless the cooling schedule is tweaked just right. 

To resolve this issue, we propose an principled approach to set the cooling schedule that adapts to each 
image at hand and requires no manual tweaking. The basic idea is to directly estimate a map from desired 
acceptance probabilities to the required temperatures. Denote 7 as shorthand for the acceptance probability. 
Disregarding the proposal distribution, consider 7 written directly in terms of the amount p of energy the 
proposed move would make: 

_ exp [-ff (L') /T] _exp[-{H{LM)+p)/T] ^ 



exp [-H (L«) /T] exp [-H (L«) /T] 



For a specific desired 7 value and known p, we can solve ( 16) for T = — ^ making it possible to adapt 
the simulated annealing cooling schedule to each image in a principled manner, rather than manually tuning 
parameters by hand. Before beginning the annealing process, we sample P(L\I, 9) to estimate the p for the 
image. Assuming a linear gradient of desired acceptance ratios, the only part that needs to be manually set 
is the acceptance probability range, 71, 72, which we set to 0.9 and 0.1 respectively to cover most of the 
range of acceptance probabilities but never making them guaranteed or impossible. 

6 Results and Discussion 



We use two pixel-labeling benchmark datasets for our experimental analysis: MSRC [ Shotton et al. 2009 1 



and SIFT-Flow [Liu et al.[[2009| . In brief, MSRC is a 21-class 596-image dataset and SIFT-Flow is a 33- 



class 2688-image dataset, both of typical natural photos. The gold standard for these data is set by manual 
human labeling and most images have a large percentage of pixels actually labeled in the gold standard. In 
both cases, we use the posted training-testing splits for learning and evaluation; we note the split is 55% 
training for MSRC and 90% training for SIFT-Flow. Finally, in the posted split for SIFT-Flow, three classes 
(cow, desert, and moon) do not appear and are hence dropped, yielding a 30-class dataset in actuality. 

6.1 Comparisons to Baselines 

Our primary evaluation goal is to determine and quantify the benefit gained through the global parts-based 
structure. Hence we make a quantitative comparison of our method against an MLE classifier and an MRF, 
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Figure 5: Quantitative results on the two data sets, comparing the MLE classifier, the MRF model over the 
superpixel basic elements, and our proposed PS3. The table shows % pixel accuracy Nu / J2j ^ij f or differ- 



ent object classes. "Global" refers to the overall error 



while "average " is ^ 



£Z \Z\Y,jezNij- 



Nij refers to the number of pixels of label i labelled j. Note the color in the table columns is intended to 
serve as the legend for Figure^ 

with results shown in Figure [5] In all cases, we use the same appearance models and assume full knowledge 
of the graph structure (for the MLE and MRF methods, this means the subset of possible class labels) for 
each test image. In the MLE case, the basic elements are assumed independent, and in the MRF case, a local 
Potts smoothness model is used between the basic elements. We note that our proposed PS3 model is also 
an MRF, but over the global scene parts and not over the superpixel basic elements. 

We do not make a comparison to other pictorial structures papers as, to the best of our knowledge, no 
existing pictorial structures method can be directly applied to the pixel labeling problem. We also do not 



show a comparison against other methods' quantitative scores on these data sets, such as [Ladicky et al. 



2009 1 who currently have the highest score on MSRC, and we want to caution the reader against doing so. 
The assumption on knowing the graph structure that we have made limits the comparability of our proposed 
method against these others, and our appearance model is comparatively simpler. Our quantitative scores 
need to be interpreted as relative among the three methods we have displayed: in all three cases we have 
used the same appearance model (discussed earlier) allowing for a controlled experimental analysis in which 
the aspect varied is how the basic elements (superpixels) are related in the overall model. In this setting, it is 
clearly demonstrated that the proposed model outperforms both the superpixel-independent MLE classifier 
and the locally connected MRF model. 

Our proposed method performs best in global and average per-class labeling accuracy over the pixel- 
independent MLE and local-MRF methods on both datasets. On MSRC we see a gain of 2% in global 
and 3% in average accuracy. These are not significant numbers, overall, but we note the significant improve- 
ment in two subsets of the classes. First, in classes with high intraclass variance, such as building, we see a 
30+% increase. Second, in classes with strong global object shape, such as airplane, we see a 20% increase. 
These exhibit the merits the global modeling of PS3 brings to the problem. The reason why the overall gain 
is not too much is that the dominant classes, such as sky, grass, and so on, have a strong visual character in 
the relatively small MSRC data set that is already easily modeled at the local level. 

We find a different case in the SIFT-Flow dataset, which is much larger and contains more intra-class vari- 
ance even for these dominant classes. In the SIFT-Flow cases, a larger increase of 9% in global and 12% in 
average accuracy is observed. We bring note to the marked improvement in some of the categories corre- 
sponding to things, such as airplane, car, door, and person. We explain this improvement as being due to the 
added modeling richness in the parts-based representation: things in the image benefit from the rich global 
description through the shape and part-context. We also note the few categories in SIFT-Flow where PS3 
was outperformed by the MLE and MRF methods (bridge, crosswalk, fence, and sun). In these cases, the 
typical object foreground is sparse and the global part description is insufficient to accommodate the weak 
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Figure 6: Visual results on the two data sets. Each columns shows an example in three rows: (1) original 
image, (2) human gold standard, and (3) our PS3 result overlaid upon the image. We have also rendered 
the graph structure on top of the image. The color legend is given in Figure^ The results on the right side 
of the figure show some of the worst examples of our performance. 

local cues, which get overtaken by the other nearby classes. Examples of this phenomena (as well as good 
cases) are given in Figure [6] 

6.2 Comparisons to State of the Art 



We also make a quantitative (Figure|7]) against a range of papers from the state of the art, TextonBoost [Shot- 
ton et al.||2009| , Mean-Shift Patches [Yang et al.[ [20071 , Graph-Shifts [Corso et al.[ [20081 , TextonForests 
[ [Shotton etaLj [20081 , and Hierarchical CRF (H-CRF) [Ladicky et a^[2009) . Nearly all of these papers can 
be classes within the "local" labeling realm. The state-of-the-art H-CRF approach in [Ladicky et al.| |20091 
makes a clever extension to define a hierarchy of random fields that has shown great potential to overcome 
the limitations of purely local labeling methods. However, it still defines the labeling problem based directly 
on local interactions of label variables rather than on object level interactions, as we do in PS3. None of the 
existing pictorial structures papers we are aware of can be directly applied to semantic image labeling and 
are hence not compared here. 

Our proposed PS3 method performs best in average per-class labeling accuracy (78%) and shows marked 
improvement in numerous classes, such as flower, bird, chair, etc. We make careful note that although the 
table directly compares PS3 to the other literature, we assume the graph structure for each testing image is 
known; notwithstanding this point, we do feel it is important to demonstrate the comparative performance 
against the state of the art. Furthermore, we note that our unary potentials are comparatively simpler (i.e., 
color and texton histograms) to those in many of the other methods. Finally, knowing the appropriate graph 
for the image does not immediately solve the problem: an MLE assignment of superpixels to elements in 
the graph yields global accuracy of 74% and average accuracy of 70% — i.e., the PS3 model is indeed adding 
power to the problem. 

Separating Objects from Stuff. The respective merits of the two top performing approaches, namely 
H-CRF and ours, PS3, become immediately evident when inspecting how the methods compare on various 
classes, as we discuss next. As we mentioned earlier, one can group the parts roughly into two types: objects 
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(cow, sheep, aeroplane, face, car, bicycle, flower, sign, bird, chair, road, cat, dog, body, and boat) and stuff 
(building, grass, tree, sky, water and road) 
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Figure 7: Quantitative results on the MSRC data set, in the same format as [Ladick y et al. 2009]/ for easy 
comparison. The table shows % pixel accuracy Nu/J2j ^ij f or different object classes. "Global" refers 



to the overall error 
label i labelled j. 



, while "average" is \z\Y^ U N- ■ • ^ij refers to the number of pixels of 
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Figure 8: Quantitative results when we group objects and stuff on the MSRC data set. The scores in this table 
are all average accuracy derived from the per-class accuracy scores from Figure^ no further processing 
was performed. See text for discussion and the list of classes for the two groups. 



The objects tend to be small and articulated and have high location variance, whereas the stuff tends to be 



relatively stable in terms of location and appearance distribution. As we showed in Section [47TJ the shape 
distributions of the stuff are uninformative, but for the objects, they are quite informative. We have claimed 
that a key merit of our method is that it allow the modeler to emphasize global object shape and relationship 
to the scene in general. This claim is clearly substantiated when looking at the comparative average accuracy 
of the objects to the stuff in Figures|7]and[8j We explain it via the components of the PS3 model as follows: 
our method performs at about the average performance for the stuff classes, which are comparatively easier 
to infer using location and appearance. Subsequently, these stuff classes are grounded and drive the object 
classes during inference allowing them to utilize the objects' richer shape and angle potentials. 

6.3 Methodological Comparisons 



Comparison to DDMCMC | |Tu and ZhuH2002| . The seminal DDMCMC work laid the groundwork for 



our approach to inference in this paper, but the underlying problem and model are quite different. Firstly, the 
DDMCMC work is an approach to the low-level image segmentation problem. No notion of object class or 
global object is incorporated into that work, which, as our quantitative results demonstrated is a significant 
merit of our proposed approach. Secondly, it is primarily seeking samples of image segmentations under 
all plausible partitions of the image. We have restricted ours to the set of superpixels, but we can plausibly 
relax our assumption. Lastly, their work did not seem to seek the modes, whereas we propose a data-adaptive 
method for mode seeking in the MCMC framework. 



Comparison to Mixture-of-Parts Pictorial Structures (MoPPS) [Hess et al., 2007 1. As far as we know, 



MoPPS is the first and only pictorial structures extension to permit part subsets. Like our method, they 
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permit a space of plausible pictorial structures. Then, the MoPPS method carefully specifies mixture dis- 
tribution over parts, a set of legal part configurations and a mechanism for returning a classical pictorial 
structure given a part subset. In the spirit of sparse-parts pictorial structures, PS3 is similar to MoPPS. But 
MoPPS remains restricted to the object modeling case (or in the paper, a nice extention to highly articulated 
football team players as an object). For a given parts subset, the MoPPS structure is classical (Gaussian 
spring model) whereas our part potential incorporate a more rich set of relations. 

7 Conclusion 

We have presented the pixel- support, parts-sparse pictorial structures, or PS3 model. PS3 makes a step in 
scene labeling by moving beyond the de facto local and region based approaches to full semantic scene la- 
beling and into a rich object-level approach that remains directly tied to the pixel level. As such, PS3 unifies 
parts-based object models and scene-based labeling models in one common methodology. Our experimental 
comparisons demonstrate the merits in moving beyond the restrictive local methods in a number of settings 
on benchmark data sets (MSRC and Sift Flow). 

PS3 has, perhaps, opened more problems than it has solved, however. For example, we have assumed 
that the graph for an image is known during inference. For general applicability, this assumption needs 
to be relaxed. Extensions of the proposed MCMC methods into jump-diffusion dynamics [Tu and Zhu 
2002J are plausible, but some approximations or other methods to marginalize the full sample-space are 
also plausible. Probabilistic ontologies and Markov logic present two potential avenues for this problem. 
Similarly, we have demonstrated that the parameter estimation problem in the PS 3 is more complex given 
the global-local part-pixel dependency. We are not aware of a principled tractable method for estimating 
these parameters. Finally, we have observed a big disparity in the respective strength of our various model 
terms for object- and stuff-type classes, but we have not incorporated this distinction into the model itself. 
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